Search CORE

91 research outputs found

Effective Spoken Language Labeling with Deep Recurrent Neural Networks

Author: Dinarelli Marco
Dupont Yoann
Tellier Isabelle
Publication venue
Publication date: 20/06/2017
Field of study

Understanding spoken language is a highly complex problem, which can be decomposed into several simpler tasks. In this paper, we focus on Spoken Language Understanding (SLU), the module of spoken dialog systems responsible for extracting a semantic interpretation from the user utterance. The task is treated as a labeling problem. In the past, SLU has been performed with a wide variety of probabilistic models. The rise of neural networks, in the last couple of years, has opened new interesting research directions in this domain. Recurrent Neural Networks (RNNs) in particular are able not only to represent several pieces of information as embeddings but also, thanks to their recurrent architecture, to encode as embeddings relatively long contexts. Such long contexts are in general out of reach for models previously used for SLU. In this paper we propose novel RNNs architectures for SLU which outperform previous ones. Starting from a published idea as base block, we design new deep RNNs achieving state-of-the-art results on two widely used corpora for SLU: ATIS (Air Traveling Information System), in English, and MEDIA (Hotel information and reservation in France), in French.Comment: 8 pages. Rejected from IJCAI 2017, good remarks overall, but slightly off-topic as from global meta-reviews. Recommendations: 8, 6, 6, 4. arXiv admin note: text overlap with arXiv:1706.0174

arXiv.org e-Print Archive

Establishing a New State-of-the-Art for French Named Entity Recognition

Author: Dupont Yoann
Muller Benjamin
Romary Laurent
Sagot Benoît
Suárez Pedro Javier Ortiz
Publication venue
Publication date: 11/05/2020
Field of study

The French TreeBank developed at the University Paris 7 is the main source of morphosyntactic and syntactic annotations for French. However, it does not include explicit information related to named entities, which are among the most useful information for several natural language processing tasks and applications. Moreover, no large-scale French corpus with named entity annotations contain referential information, which complement the type and the span of each mention with an indication of the entity it refers to. We have manually annotated the French TreeBank with such information, after an automatic pre-annotation step. We sketch the underlying annotation guidelines and we provide a few figures about the resulting annotations

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Un corpus libre, évolutif et versionné en entités nommées du français

Author: Dupont Yoann
Publication venue: HAL CCSD
Publication date: 01/07/2019
Field of study

International audienceA free, evolving and versioned french named entity recognition corpus. Annotated corpora are very hard resources to make because of the high human cost they imply. Once released, they are hardly modifiable and tend to not evolve through time. In this article we present a free and evolving corpus annotated in named entity recognition based on French Wikinews articles from 2016 to 2018, for a total of 1191 articles. We will briefly describe the annotation guidelines before comparing our corpus to various corpora of comparable nature. We will also give an intra-annotator-agreement to provide an estimation of the stability of the annotation as well as the overall process to develop the corpus.Les corpus annotés sont des ressources difficiles à créer en raison du grand effort humain qu'elles impliquent. Une fois rendues disponibles, elles sont difficilement modifiables et tendent à ne pas évoluer pas dans le temps. Dans cet article, nous présentons un corpus annoté pour la reconnaissance des entités nommées libre et évolutif en utilisant les textes d'articles Wikinews français de 2016 à 2018, pour un total de 1191 articles annotés. Nous décrivons succinctement le guide d'annotation avant de situer notre corpus par rapport à d'autres corpus déjà existants. Nous donnerons également un accord intra-annotateur afin de donner un indice de stabilité des annotations ainsi que le processus global pour poursuivre les travaux d'enrichissement du corpus

INRIA a CCSD electronic archive server

Adapt a Text-Oriented Chunker for Oral Data: How Much Manual Effort is Necessary?

Author: Dupont Yoann
Eshkol Iris
Tellier Isabelle
Wang Ilaine
Publication venue: HAL CCSD
Publication date: 20/10/2013
Field of study

International audienceIn this paper, we try three distinct approaches to chunk transcribed oral data with labeling tools learnt from a corpus of written texts. The purposeis to reach the best possible results with the least possible manual correction or re-learning effort

HAL Université de Tours

Peut-on bien chunker avec de mauvaises étiquettes POS ?

Author: Dupont Yoann
Eshkol-Taravella Iris
Tellier Isabelle
Wang Ilaine
Publication venue: HAL CCSD
Publication date: 02/07/2014
Field of study

http://www.taln2014.org/site/actes-en-ligne/actes-en-ligne-articles-taln/National audienceDans cet article, nous testons deux approches distinctes pour chunker un corpus oral transcrit, en cherchant à minimiser les étapes de correction manuelle. Nous ré-utilisons tout d'abord un chunker appris sur des données écrites, puis nous tentons de ré-apprendre un chunker spécifique de l'oral à partir de données annotées et corrigées manuellement, mais en faible quantité. L'objectif est d'atteindre les meilleurs résultats possibles pour le chunker en se passant autant que possible de la correction manuelle des étiquettes POS. Nos expériences montrent qu'il est possible d'apprendre un nouveau chunker performant pour l'oral à partir d'un corpus de référence annoté de petite taille, sans intervention sur les étiquettes POS. Abstract. In this paper, we test two distinct approaches to chunk transcribed oral data, trying to minimize the phases of manual correction. First, we use an existing chunker, learned from written texts, then we try to learn a new specific chunker from a small amount of manually corrected labeled oral data. The purpose is to reach the best possible results for the chunker with as few manual corrections of the POS labels as possible. Our experiments show that it is possible to learn a new effective chunker for oral data from a labeled reference corpus of small size, without any manual correction of POS label

HAL Université de Tours

Intégrer des connaissances linguistiques dans un CRF : application à l'apprentissage d'un segmenteur-étiqueteur du français

Author: Billot Sylvie
Constant Mathieu
Duchier Denys
Dupont Yoann
Sigogne Anthony
Tellier Isabelle
Publication venue: HAL CCSD
Publication date: 27/06/2011
Field of study

International audienceDans cet article, nous synthétisons les résultats de plusieurs séries d'expériences réalisées à l'aide de CRF (Conditional Random Fields ou "champs markoviens conditionnels") linéaires pour apprendre à annoter des textes français à partir d'exemples, en exploitant diverses ressources linguistiques externes. Ces expériences ont porté sur l'étiquetage morphosyntaxique intégrant l'identification des unités polylexicales. Nous montrons que le modèle des CRF est capable d'intégrer des ressources lexicales riches en unités multi-mots de différentes manières et permet d'atteindre ainsi le meilleur taux de correction d'étiquetage actuel pour le français

HAL Descartes

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

RETRATO DE MILITARES [Material gráfico]

Author: Delaborde Marine
Dupont Yoann
Grobol Loïc
Landragin Frédéric
Publication venue
Publication date: 14/05/2018
Field of study

Copia digital. Madrid : Ministerio de Educación, Cultura y Deporte, 201

Biblioteca Virtual del Patrimonio Bibliográfico (Virtual Library of Bibliographical Heritage)

INRIA a CCSD electronic archive server

French CrowS-Pairs: Extension à une langue autre que l'anglais d'un corpus de mesure des biais sociétaux dans les modèles de langue masqués

Author: Bezançon Julien
Dupont Yoann
Fort Karën
Névéol Aurélie
Publication venue: HAL CCSD
Publication date: 27/06/2022
Field of study

National audienceTo widen the scope of bias studies in natural language processing beyond American English we introduce material for measuring social bias in language models against demographic groups in France. We extend the CrowS-pairs dataset with 1,677 sentence pairs in French that cover stereotypes in ten types of bias. 1,467 sentence pairs are translated from CrowS-pairs and 210 are newly crowdsourced and translated back into English. The sentence pairs contrast stereotypes concerning underadvantaged groups with the same sentence concerning advantaged groups. We find that four widely used language models favor sentences that express stereotypes in most bias categories. We report on the translation process and offer guidelines to further extend the dataset to other languages.Afin de permettre l'étude des biais en traitement automatique de la langue au delà de l'anglais américain, nous enrichissons le corpus américain CrowS-pairs de 1 677 paires de phrases en français représentant des stéréotypes portant sur dix catégories telles que le genre. 1 467 paires de phrases sont traduites à partir de CrowS-pairs et 210 sont nouvellement recueillies puis traduites en anglais. Selon le principe des paires minimales, les phrases du corpus contrastent un énoncé stéréotypé concernant un groupe défavorisé et son équivalent pour un groupe favorisé. Nous montrons que quatre modèles de langue favorisent les énoncés qui expriment des stéréotypes dans la plupart des catégories. Nous décrivons le processus de traduction et formulons des recommandations pour étendre le corpus à d'autres langues. Attention : Cet article contient des énoncés de stéréotypes qui peuvent être choquants

INRIA a CCSD electronic archive server

Description et modélisation des chaînes de référence. Le projet ANR Democrat (2016-2020) et ses avancées à mi-parcours

Author: Delaborde Marine
Dupont Yoann
Grobol Loïc
Landragin Frédéric
Publication venue: HAL CCSD
Publication date: 14/05/2018
Field of study

International audienceLe projet ANR Democrat vise à développer les recherches sur la langue et la structuration textuelle du français via l’analyse détaillée et contrastive des chaînes de références (instanciations successives d’une même entité) dans un corpus diachronique de textes écrits entre le 9ème et le 21ème siècle, avec des genres textuels variés. Il réunit des chercheurs issus des laboratoires Lattice, LiLPa, ICAR et IHRIM. Il a été lancé en mars 2016 et l’essentiel des efforts porte actuellement sur l’annotation (manuelle) d’un corpus. Plusieurs expérimentations d’annotation ont eu lieu, de manière à tester différentes procédures. La procédure retenue alterne des phases manuelles et des phases automatiques pour compléter les annotations, via le lancement de scripts

INRIA a CCSD electronic archive server

FENEC : un corpus à échantillons équilibrés pour l'évaluation des entités nommées en français

Author: Dupont Yoann
Fort Karën
Jouglar Alexane
Millour Alice
Publication venue: HAL CCSD
Publication date: 27/06/2022
Field of study

National audienceWe present FENEC (FrEnch Named-entity Evaluation Corpus), a balanced sample corpus containing six genres and annotated with named entities according to Quæro, a rich annotation scheme. The characteristics of this corpus allow us to evaluate and compare three automatic annotation tools—one rule-based and two neural network-based—by playing on three dimensions of the evaluation: the precision of the label set, the genre of the corpora, and the evaluation metrics.Nous présentons ici FENEC (FrEnch Named-entity Evaluation Corpus), un corpus à échantillons équilibrés contenant six genres, annoté en entités nommées selon le schéma fin Quaero. Les caractéristiques de ce corpus nous permettent d'évaluer et de comparer trois outils d'annotation automatique-un à base de règles et deux à base de réseaux de neurones-en jouant sur trois dimensions : la finesse du jeu d'étiquettes, le genre des corpus, et les métriques d'évaluation

INRIA a CCSD electronic archive server